1802.04289v2 [cs.AI] 18 Feb 2018 


. 
. 


arXıv 


Deep Neural Networks for Bot Detection 


Sneha Kudugunta 
Indian Institute of Technology, Hyderabad 
Hyderabad, India 
cs14btech11020@iith.ac.in 


ABSTRACT 


The problem of detecting bots, automated social media accounts 
governed by software but disguising as human users, has strong 
implications. For example, bots have been used to sway political 
elections by distorting online discourse, to manipulate the stock 
market, or to push anti-vaccine conspiracy theories that caused 
health epidemics. Most techniques proposed to date detect bots 
at the account level, by processing large amount of social media 
posts, and leveraging information from network structure, tempo- 
ral dynamics, sentiment analysis, etc. In this paper, we propose a 
deep neural network based on contextual long short-term mem- 
ory (LSTM) architecture that exploits both content and metadata 
to detect bots at the tweet level: contextual features are extracted 
from user metadata and fed as auxiliary input to LSTM deep nets 
processing the tweet text. Another contribution that we make is 
proposing a technique based on synthetic minority oversampling 
to generate a large labeled dataset, suitable for deep nets training, 
from a minimal amount of labeled data (roughly 3,000 examples of 
sophisticated Twitter bots). We demonstrate that, from just one sin- 
gle tweet, our architecture can achieve high classification accuracy 
(AUC > 96%) in separating bots from humans. We apply the same 
architecture to account-level bot detection, achieving nearly per- 
fect classification accuracy (AUC > 99%). Our system outperforms 
previous state of the art while leveraging a small and interpretable 
set of features yet requiring minimal training data. 


CCS CONCEPTS 


e Networks — Social media networks; - Information systems 
— Web and social media search; e Computer systems orga- 
nization — Neural networks; 


1 INTRODUCTION 


During the past decade, social media like Twitter and Facebook 
emerged as a widespread tool for massive-scale and real-time com- 
munication. These platforms have been promptly praised by some 
researchers for their power to democratize discussions [39], for 
example by allowing citizens of countries with oppressing regimes 
to openly discuss social and political issues. However, due to many 
recent reports of social media manipulation, including political 
propaganda, extremism, disinformation, etc., concerns about their 
abuse are mounting [22]. 

One example of social media manipulation is the use of bots (a.k.a. 
social bots, or sybil accounts), user accounts controlled by software 
algorithms rather than human users. Bots have been extensively 
used for disingenuous purposes, ranging from swaying political 
opinions to perpetuating scams. Existing social media bots vary 
in sophistication. Some bots are very simple and merely retweet 
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posts of interest, whereas others are more sophisticated and have 
the capability to even interact with human users. 

The challenge of bot detection has thus been faced by our re- 
search community. To detect these types of social media bots, dif- 
ferent approaches have been proposed. Supervised learning in par- 
ticular exhibited promising results: examples of activity of human 
users and bots, labeled as such, can be fed to machine learning 
algorithms; trained models are then used to classify unforeseen 
accounts, leveraging data obtained e.g., by using the Twitter API: 
this may help determine the nature of suspicious accounts. Alterna- 
tives based on unsupervised learning aimed at identify large-scale 
behavioral anomalies and associate them to bot accounts. 

However, most, if not all, of the successful methods introduced 
so far detect bots at the account level. This means that, given a 
record of activity (e.g., a few hundred tweets posted by a user), 
the algorithm would determine whether the scrutinized account is 
likely a bot or not. These approaches tend to focus on the overall 
activity of the account, e.g., the content and sentiment of user posts, 
the network structure, and the temporal activity patterns. 

Though reasonably successful, account-level bot detection ap- 
proaches are expensive as they require significant amounts of data 
from each user to scrutinize, as well as large labeled datasets for 
training purposes. In contrast with that, most available labeled 
datasets have at most a few hundreds examples of tweets posted by 
a few thousands bots. For a comprehensive survey of bot detection 
methods, we direct the user to [20]. 


1.1 Research questions 


These fundamental limitations pose two research questions, that 
we try to address in this paper: 


RQ1: Is it possible to accurately predict whether a given tweet has 
been posted by a bot or human account? 

RQ2: Is it possible to enhance existing labeled datasets to produce 
more examples of bot and human accounts without the ad- 
ditional (and very expensive) data collection and annotation 
steps? 


1.2 Contributions of this work 
The contributions we provide here aim to address these challenges: 


(1) We advance the problem of classifying individual social 
media accounts from single observations, i.e., determining 
whether a single tweet comes from a Twitter bot or from a 
human user. We demonstrate that tweet-level bot detection 
is possible and can be very accurate: by exploiting both tex- 
tual features and tweet metadata, we detect bots from single 
tweet and even exceed the performance of earlier works that 
make use of a given user’s entire profile and recent posting 
history. 


(2) As a technical contribution, we introduce the concept of a 
Contextual LSTM (long short-term memory) deep neural net- 
work [23, 30], an architecture that takes both the tweet text 
and the tweet metadata as an input. Related architectures 
that use side-information to enhance recurrent model repre- 
sentations have been alluded to by some authors working 
primarily in language models [4, 29, 41], but has never been 
used in the context of social media classification to the best 
of our knowledge. The proposed architecture allows us to 
reach state-of-the-art performance in bot detection (over 96% 
AUC scores). 

(3) Finally, we introduce a technique based on the usage of syn- 
thetic minority oversampling [9] to enhance existing datasets 
by generating additional labeled examples. This will allow 
us to achieve near perfect classification performance on the 
account-level bot detection task, by leveraging only a mini- 
mal number of features and very small training datasets. 


1.3 Impact of this work 


A successful tweet-level bot detection approach would potentially 
overcome the limitations presented above, namely the need for 
computationally expensive modes that require large numbers of 
features, large labeled datasets for training purposes, and access to 
the recent history of activity of the user profile to scrutinize. 

Given the same pool of users, a tweet-based bot detection ap- 
proach would have significantly more labeled examples to exploit. 
For example, in the dataset we use in this paper (discussed in the 
next section), we have labels for 3,474 human users, which overall 
generated 8,377,522 tweets; we also have labels for 4,912 social bots, 
which generated 3,457,344 tweets. 

A tweet-level detection approach would be capable of leveraging 
nearly 12 million labeled datapoints, while an account-level detec- 
tion system would only be able to exploit about eight thousands 
examples of bots and human accounts, while using those millions 
of tweets to learn patterns associated with the originating accounts. 

Shifting to tweet-level bot detection, and thus having training 
data orders of magnitude larger than otherwise, makes the problem 
of bot detection far more amenable to the usage of deep learning 
models. Such techniques benefit greatly from vast amounts of la- 
beled examples and show extremely high performance in many 
contexts where such large annotated datasets are available [36], 
from image classification [35] to mastering games [44, 45, 51]. 

Traditional deep learning techniques used for text classification 
purposes (as well as in the broader context of language models) rely 
solely on textual features (e.g., characters or n-grams) John [32]. A 
straightforward implementation of such techniques to tweet-level 
bot detection could be based exclusively on tweet texts as inputs 
for the deep neural network of choice. However, prior results in 
bot detection suggested that tweet text alone is not highly predic- 
tive of bot accounts [20]. Exploiting additional features such as 
account metadata, network structure information, or temporal ac- 
tivity patterns, have been found to yield more robust and accurate 
results. 

To draw a parallel with recent advances in natural language 
processing (NLP) powered by deep learning, we here propose a 
novel Contextual LSTM architecture that utilizes both tweet text and 


account metadata (which are provided by the Twitter API alongside 
with the tweet itself, and do not require extra data collection steps) 
to yield a high classification accuracy. 

We hypothesize that the proposed model can be used in other 
deep learning applications where multimodal data are available for 
such types of classification tasks. 

A successful tweet-level bot detection system also has interesting 
practical implications. 


e Identifying instances of large numbers of bot-generated 
tweets coming from a single account would enable us to 
identify bots that have identifiably bot-generated tweets in- 
terspersed with human generated tweets, whether manually 
generated or retweeted. 

e Since tweets are often viewed as a part of a feed containing 
both genuine and bot-generated tweets, it is potentially use- 
ful to be able to flag isolated tweets as possibly bot-generated. 


2 DATASET 


The dataset used in our work is the dataset presented in [14], which 
contains an entirely new breed of social bots. We use a mixture 
of the groups genuine accounts, social spambots #1, social 
spambots #2 and social spambots #3. 

All these subsets of data together have over 8,386 user accounts, 
and over 11,834,866 tweets to train on. A group-wise breakdown 
may be seen in Table 1. 


Dataset Accounts | Tweets 
genuine accounts 3,474 8,377,522 
social spambots #1 991 1,610,176 
social spambots #2 3,457 428,542 
social spambots #3 464 1,418,626 


Table 1: Breakdown of the dataset used to train our models. 
The dataset was obtained from Cresci and collaborators [14]. 


Though many established techniques use a large number of 
features ([15], for example uses over 1,500 features), recent research 
[19, 20] shows that similar high performance can be obtained by 
using a minimal number of features. For account-level bot detection, 
we use the following features: 


Statuses Count 

Followers Count 

Friends Count 

Favorites Count 

Listed Count 

Default Profile 

Geo Enables 

Profile Uses Background image 
Verified 

Protected 


Similarly, for tweet level classification we use only 6 features, 
apart from the tweet content itself: 
Retweet Count 
Reply Count 
Favorite Count 
Number of Hashtags 
Number of URLs 


e Number of Mentions 


The choice of limiting the size of the feature set is motivated by 
two important reasons: 


e Model efficiency: A reduced set of features yields very ef- 
ficient models that can be trained faster and are less prone 
to overfitting, which is a common issue in social media data 
mining due to the presence of outliers. 

e Interpretability: A limited set of features with an obvious 
meaning, like the ones provided by account metadata, allows 
to produce interpretable models. This is a very important 
point, especially when combined with deep learning strate- 
gies that are notoriously hard to interpret. 


Our choice goes in antithesis to that of many feature-based sys- 
tems, which are designed to leverage hundreds, or even thousands 
of features, but whose computational efficiency is suboptimal and 
whose interpretability is challenging. 


3 METHODS 


In this study we face two classification tasks: account-level bot de- 
tection and tweet-level bot detection. In the following, we describe 
the methodological approaches that we adopted to address these 
two challenges. 


3.1 Task 1: Account-level Classification 


Previous work on account-level classification has found that user 
metadata tends to be the best predictor for bot detection [19]. As 
presented in Section 2, we use a minimal number of highly in- 
terpretable features that require little to no preprocessing. This 
enabled us to use a multitude of out-of-the-box classical machine 
learning approaches as listed in Section 4. 

We found that most of these approaches sufficed, with most 
approaches crossing AUC 90%. Our most successful approach was 
using an Random Forest classifier, giving an AUC of 98.45%. 

However, significant performance gains were observed on bal- 
ancing the dataset with oversampling techniques, specifically the 
synthetic minority oversampling technique (SMOTE) [9]. 

The SMOTE algorithm generates samples based on the feature 
space of the minority examples (i.e., the class that has the fewer num- 
ber of labeled datapoints), and is a powerful method that has seen 
successfully across many domains [28]. Specifically, we use a com- 
bination of SMOTE and two undersampling techniques. Such data 
enhancement techniques are used to remove any bias introduced by 
oversampling. Here, we combine SMOTE with data enhancement 
via (1) Edited Nearest Neighbors (ENN) [58] and (2) Tomek Links [55]. 
These two combinations have been found to give excellent results 
on imbalanced data [5]. 

Though combining SMOTE and undersampling through Tomek 
Links (SGMOTOMEK) does not improve results by much, as we 
will discuss later, significant improvement is seen by combining 
SMOTE and undersampling through ENN (SMOTENN) across all 
models. With SMOTENN, near perfect classification accuracy is 
achieved with the best model being an AdaBoost Classifier at 99.81% 
accuracy. 


Our results will suggest that near-perfect accuracy bot detection 
prediction at the account level can be achieved even without com- 
plex deep learning architectures. The same does not hold for the 
next task, i.e, that of detecting bots from single observations. 


3.2 Task 2: Tweet-level Classification 


We here introduce the problem of determining from a single data- 
point (e.g., a single tweet) whether the user in question is a bot or 
not. 

The approaches that do use tweet content tend to use feature 
engineering and specific characteristics of the tweets, such as those 
extracted via Parts-of-Speech tagging, counting the number of 
hashtags or measuring tweet dissimilarity [20]. For bot detection in 
particular, there is a dearth of convincingly successful approaches 
based on using single observations. 

As a baseline, we attempted to use an approach similar to that 
of Section 3.1 by exploiting just the features described in Section 
2. Without oversampling, however, none of our methods exceeded 
an AUC of 77%. By means of oversampling with SMOTE followed 
by undersampling through ENN, however, our results somewhat 
improve, as we will discuss in Section 4.1. 

Many of the state-of-the-art techniques from NLP using textual 
content tend to focus on the average pattern of tweeting style. 
Mostly relying on traditional data mining and NLP techniques, these 
methods have been proven ineffective against more advanced social 
bots [14]. To overcome the limitations of traditional techniques, 
we use Long Short Term Memory (LSTM) models [30], a superior 
variant of Recurrent Neural Networks (RNNs) [33]. RNNs and their 
variants have been found to been effective for NLP tasks, given 
their ability to learns relationships in sequential data [24]. 

To transform the tweets into a form suitable for LSTMs, as an 
embedding we use a pre-trained set of Global Vectors for Word Rep- 
resentation (GloVE) meant for Twitter data [47]. GloVE is a global 
log-bilinear regression model that global matrix factorization and 
local context window methods to effectively learn the substructure 
of natural language, by training on word co-occurrence. Prior to 
using this embedding, we preprocess our tweets by tokenizing them 
using the methods suggested by the creators of GIoVE, as follows. 


3.2.1 Preprocessing Data. Prior to training the LSTM on the 
tweets, we preprocess the data by forming a string of tokens from 
each tweet. 


e We replace occurrences of hashtags, URLs, numbers and 
user mentions with the tags “<hashtag>", “<url>", “<num- 
ber>", or ‘<user>". 

e Similarly, most common emojis are replaced with “<smile>", 
“<heart>", “<lolface>", “<neutralface>" or “<angryface>", 
depending on the specific emoji. 

e For words written in upper case letters or for words con- 
taining more than 2 repeated letters, a tag denoting that is 
placed after the occurrence of the word. For example, the 
word “HAPPY" would by replaced by two tokens, “happy" 
and “<allcaps>". 


e All tokens are converted to lower case. 


Then, these tokenized tweets are transformed into an embedding 
using the aforementioned pre-trained GloVE model. The resulting 
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Figure 1: Architecture of model for tweet-level bot detection 
that takes only the tweet content as its input. 


sequence of vectors is then fed to the LSTM that outputs a single 32- 
dimension vector that is then fed forward through 2 ReLU activated 
layers of size 128 and 64 to give the output. A diagram of this simple 
LSTM architecture can be seen in Figure 1. It is to be noted that our 
model resets its state after each input, and therefore only learning 
sequential structure within each tweet and not the sequence of 
tweets. 


3.2.2 Limits of traditional LSTM deep nets. However, this ap- 
proach only uses the text content of the tweets, and does not utilize 
the metadata associated with it. As observed from the results of 
Table 3, which will be discussed later into detail, although using 
metadata does not precisely predict the nature of a user, it does at 
least weakly predict the same. In other words, metadata are weak 
predictors of an account’s nature (bot or not). 

Since metadata are not sequential data (unlike tweet text) that 
can be fed into the LSTM along with the text embedding, we instead 
propose anew Contextual LSTM architecture that can effectively uti- 
lize both the text and the metadata, even though they are predictors 
of different strengths. 


3.2.3 Contextual LSTM architecture. As seen in Figure 2, our 
proposed architecture has multiple inputs and outputs. Multi-input 
recurrent models have been suggested before, mainly in language 
models: for example, some authors [4, 29, 41] adopt the idea of 
giving auxiliary inputs to either the input, hidden, or output layer 
of the recurrent model to enrich the learned representations. 
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Figure 2: Architecture of the proposed contextual LSTM that 
uses both tweet content and the metadata that comes with 
the tweet. 


In our work, we give auxiliary input to the output layer. Similar 
to the tweets-only model previously described, the main input is 
the tweet text that is tokenized and transformed into a set of GloVE 
vectors before feeding them into the LSTM. This again results in an 
output vector that is concatenated with our auxiliary input and then 
given as input to a 2-layer neural network with ReLU activations 
to yield the output. The exact sizes of layers are the same as that of 
our previous model. 

As a regularization mechanism, we introduce an auxiliary output 
after the LSTM output whose target is also the classification label. 
Such a mechanism has been used successfully before in [54]. The 
total loss is a weighted average of the auxiliary output loss and the 
main output loss (0.2:0.8 in this case). 


4 RESULTS 


Here, we have tabulated our results for the approaches outlined 
in the previous section. The success of each method is measured 
on a variety of metrics, namely precision, recall, f1-score, accu- 
racy and Area Under the Receiver Operating Characteristic Curve 
(AUC/ROC). 


4.1 Task 1: Account-Level Classification 


In Table 2, we have tabulated results of our experiments on account- 
level classification. 

The first batch of five classifiers shows the results by solely 
using the data without any oversampling methods: Random Forest 
Classifiers and AdaBoost Classifiers yield the best results, with an 


AUC > 98%. Overall, all classifiers perform really well, suggesting 
that the account-level classification task, at least for the dataset 
under analysis here, is not particularly daunting. 

The second batch of systems reported in Table 2 shows the result 
obtained after using oversampling with SMOTE, followed by under- 
sampling with ENN (SMOTENN): the results are improved across 
all classifiers, with the AdaBoost classifier achieving near perfect 
accuracy of 99.81%. This suggests that our strategy of synthetic 
minority oversampling is really effective when dealing with this 
types of account-level classification tasks. 

The third batch of systems shows the results of oversampling 
with SMOTE followed by undersampling with TOMEK (SMOTOMEKRK): 
with this approach the results improve as well, though not as dras- 
tically as by using ENN. 

Overall, we demonstrated the feasibility of high-accuracy account- 
level bot detection by means of unsophisticated off-the-shelf al- 
gorithms in combination with synthetic minority oversampling 
techniques to enhance training data. 


4.2 Task 2: Tweet-Level Classification 


We now illustrate the proposed tweet-level bot detection task. Table 
3 shows four batches of systems: the first three show the baseline 
models provided by the off-the-shelf classifiers, specifically in their 
naive implementation (first batch), and by using again synthetic 
minority oversampling in combination with ENN (second batch) or 
TOMEK (third batch); finally, the fourth and last batch shows the 
performance of the proposed LSTM deep architectures, showing 
different configurations. 

The first batch of systems that only uses the tweet metadata does 
not work particularly well. Without data augmentation, none of 
the results cross an AUC of 78%. 

By using synthetic minority oversampling, in combination with 
ENN, results drastically improve: the second batch of systems shows 
that all models approach an accuracy between 88% and 90%. 

However, no such improvement is seen by using SMOTE in com- 
bination with TOMEK. Although the reasons of such a difference 
with respect to the previous task are not apparent, and warrant fur- 
ther investigation, we hypothesized that ENN works better because 
tweets originated by bots exhibit identical or very similar metadata 
thus a nearest-neighbor strategy works well. 

The fourth batch of results present our architecture and deserves 
an in-depth discussion. By using only the tweets, our LSTM system 
provides a classification accuracy of 95.53% (cf. “LSTM (Tweets- 
only + 50D GloVE)”. This result is very promising: the use of our 
architecture yields a lift in performance in the order of 5% even just 
using the tweet texts. 

Following the intuition that metadata can enrich the informa- 
tion available for classification purposes, the last four systems 
show the results of our Contextual LSTM combining metadata and 
tweet text features. Our Contextual LSTM shows a boost in perfor- 
mance yielding a classification accuracy of about 96.33% for the 
best model with 200-dimension GloVE embedding (cf. “Contextual 
LSTM (200D GIloVE)”). It is worth noting different configuration of 
GIoVE, namely varying the dimensional of the word embedding 
space, do not affect the performance much: a general trend seems 
to suggest that higher dimensionality yield increasingly slightly 


better performance. In conclusion, from our analysis we derived 
that, even though metadata is shown to be a weak predictor for 
our baseline results, our proposed architecture uses the extra infor- 
mation provided by the metadata to yield slightly more accurate 
prediction results. It is to be noted that the metadata has not been 
oversampled for the Contextual LSTM systems. 


4.3 Analysis 


Deep neural network models are often touted as uninterpretable 
black boxes. Although they provide impressive results in many prac- 
tical applications, various researchers, practitioners, policy makers, 
and funding agencies [26] demand attention to the matter of inter- 
pretability [48]. Recently, several studies aimed at shedding light 
on the inner states of recurrent models by visualizing their hidden 
state dynamics. Karpathy and collaborators, for example, examined 
inner states of language models [34], while Li et al. [38] introduced 
strategies to help interpreting deep neural network results in NLP 
tasks. With this in mind, we here attempt to understand and inter- 
pret the effectiveness of our methods by analyzing the hidden state 
activations of our neural network model. 

We first visualize the change in the representation, i.e., the output 
over time of the LSTM hidden units. We give examples of tweets 
generated by a human and by a bot in Figures 3 and 4 respectively. 
We observe significant differences in the activations between the 
examples, suggesting that the LSTM hidden units may correspond 
to different linguistic features. 

It has been observed by Cresci et al. [14] that using simple textual 
features, as seen in prior work [42], reveals ineffective against the 
more sophisticated social bots we study in this work. We plot the 
distributions of the final LSTM representations of genuine tweets 
and bot-generated tweets in Figures 5 and 6 respectively. We see 
that the majority of hidden units have a significant difference in 
the distribution of activation values for genuine tweets versus to 
bot-generated tweets. For example, Unit 0 (bottom blue box) has a 
broad activation range centered around -0.6 for tweets generated 
by bots, as opposed to a narrow activation range centered around 0 
for tweets generated by humans. Similarly, Unit 31 (top orange box) 
is activated in the negative [-1,0] range for bot-generated tweets, 
but again narrowly around 0 for genuine tweets. These examples 
suggest the mechanism behind the ability of LSTM to learn more 
complex, non-linear features, which are effective in identifying 
bot-generated tweets even from single observations. 


4.4 Considerations and Limitations 


We developed bot detection systems requiring a minimal number of 
features as well as one single observation of a tweet generated by the 
account to be scrutinized. For the account-level bot detection, we 
showed that using oversampling and data enhancement techniques 
could yield models that perform nearly perfectly. 

Our main contribution, however, is presenting to the best of our 
knowledge the first bot detection model that detects the nature 
of an account from a single tweet. First, we find that simply us- 
ing the tweet metadata does not suffice. We also show that while 
simply using the content of the tweet works quite well with our pro- 
posed LSTM architecture, we can slightly improve the performance 
utilizing the information from the metadata. 


System Precision | Recall | F1-Score | Accuracy | AUC/ROC 
Logistic Regression 0.94 0.93 0.93 0.9066 0.8891 
SGD Classifier 0.87 0.87 0.87 0.8726 0.8680 
Random Forest Classifier 0.98 0.98 0.98 0.9839 0.9845 
AdaBoost Classifier 0.98 0.98 0.98 0.9823 0.9823 
2-layer NN (500,200,1) RelU+Adam 0.95 0.95 0.95 0.9496 0.9475 
Logistic Regression (With SMOTENN) 0.99 0.99 0.99 0.9859 0.9862 
SGD Classifier (With SMOTENN) 0.95 0.94 0.94 0.9433 0.9443 
Random Forest Classifier (With SMOTENN) 0.99 0.99 0.99 0.9937 0.9938 
AdaBoost Classifier (With SMOTENN) 1.00 1.00 1.00 0.9981 0.9981 
2-layer NN (300,200,1) RelU+Adam (With SMOTENN) 0.99 0.99 0.99 0.9878 0.9879 
Logistic Regression (With SMOTOMEK) 0.92 0.91 0.91 0.9094 0.9098 
SGD Classifier (With SMOTOMEK) 0.90 0.90 0.90 0.9030 0.9031 
Random Forest Classifier (With SMOTOMEK) 0.99 0.99 0.99 0.9859 0.9859 
AdaBoost Classifier (With SMOTOMEK) 0.99 0.99 0.99 0.9865 0.9865 
2-layer NN (300,200,1) RelU+Adam (With SMOTOMEK) 0.95 0.95 0.95 0.9391 0.9489 


Table 2: Classification performance of various systems on the account-level (user) bot detection task. The first batch of systems 
represent traditional off-the-shelf baseline approaches, that already exhibit very accurate performance. The second and third 
batches of systems are enhanced by means of synthetic minority oversampling techniques, to illustrate how it is possible to 
achieve nearly perfect account-level bot detection without the need for complex deep architectures. For each batch of systems 
we highlighted the best accuracy and AUC/ROC performing ones: AdaBoost consistently provides the top (or nearly the top) 
performance across all account-level bot detection benchmarks. 


System Precision | Recall | F1-Score | Accuracy | AUC/ROC 
Logistic Regression (Metadata-only) 0.80 0.80 0.79 0.8008 0.7633 
SGD Classifier (Metadata-only) 0.76 0.76 0.75 0.7625 0.7191 
Random Forest Classifier (Metadata-only) 0.80 0.80 0.80 0.8042 0.7765 
AdaBoost Classifier (Metadata-only) 0.80 0.80 0.79 0.7991 0.7618 
Logistic Regression (Metadata-only + SMOTENN) 0.92 0.92 0.92 0.9188 0.8820 
SGD Classifier (Metadata-only + SMOTENN) 0.91 0.90 0.90 0.8992 0.8860 
Random Forest Classifier (Metadata-only + SMOTENN) 0.92 0.92 0.92 0.9233 0.8806 
AdaBoost Classifier (Metadata-only + SMOTENN) 0.93 0.92 0.93 0.9234 0.9065 
Logistic Regression (Metadata-only+SMOTOMEK) 0.79 0.77 0.76 0.7666 0.7667 
SGD Classifier (Metadata-only+SMOTOMEK) 0.78 0.77 0.76 0.7664 0.7664 
Random Forest Classifier (Metadata-only+SMOTOMEK) 0.79 0.77 0.77 0.7747 0.7748 
AdaBoost Classifier (Metadata-only+SMOTOMEK) 0.79 0.77 0.77 0.7715 0.7716 
LSTM (Tweet-only + 50D GloVE) 0.96 0.96 0.96 0.9553 0.9567 
Contextual LSTM (25D GloVE) 0.96 0.96 0.96 0.9567 0.9585 
Contextual LSTM (50D GloVE) 0.96 0.96 0.96 0.9618 0.9627 
Contextual LSTM (100D GloVE) 0.96 0.96 0.96 0.9618 0.9626 
Contextual LSTM (200D GloVE) 0.96 0.96 0.96 0.9633 0.9643 


Table 3: Classification performance of various systems, including the proposed Contextual LSTM, on the tweet-level bot de- 
tection task. The first batch of systems represent traditional off-the-shelf baseline approaches: their accuracy and AUC/ROC 
scores range between 71% and 80%. The second and third batches of systems are enhanced by means of synthetic minority 
oversampling techniques, showing better classification performance between 76% and 90% accuracy and AUC/ROC scores. The 
LSTM architecture that we propose is presented in the forth batch of systems: our results outperform the other approaches by 
a significant margin, averaging above 96% accuracy and AUC/ROC scores, to demonstrate that tweet-level bot detection can 
be achieved with extremely high accuracy, small number of features, and limited-size training datasets. For each batch of sys- 
tems we highlighted the best Accuracy and AUC/ROC performing ones. Among the traditional machine learning models (i.e., 
excluding the proposed approach that significantly outperforms all the baselines and their variants using synthetic minority 
oversampling), both Random Forest and AdaBoost classifiers appear to consistently deliver the best performance across all 
tweet-level bot detection benchmarks. 


5 RELATED WORK 


Bots (a.k.a., social bots, or sybil accounts) have been found guilty 
of polluting social media conversations in a variety of scenarios. 
Reports of online manipulation mediated by bots span political 
conversation [6, 21, 31, 40, 59], fake news [19, 50], conspiracy theo- 
ries [53], stock market manipulation [18], public health [12], and 


more [27] — it is worth noting that in some rare occasions bots have 
been used to deliver positive interventions [46, 49], rather than for 
nefarious purposes. The research community promptly responded 
to the problem of the increasing pervasiveness of bots in platforms 
like Twitter and Facebook. A wealth of strategies and frameworks 
have been proposed to address the challenge of bot detection in 


LSTM Units 


<user> your brain doesn't function right when you're hungry. so chow down 


Figure 3: Representations over time of LSTM 32 hidden units activations for a tweet generated by a human. Each column 
corresponds to LSTM outputs at each time step. Cells correspond to the 32 dimensions of the representation at each timestep. 


LSTM Units 


check out this trust me im a 


nerd mens sweatshirt - awesome <url> 


Figure 4: Representations over time of LSTM 32 hidden units activations for a tweet generated by a bot. Each column corre- 
sponds to LSTM outputs at each time step. Cells correspond to the 32 dimensions of the representation at each timestep. 


these environments. A recent review [20] proposed a taxonomy 
describing three types of approaches: (a) methods based on social 
network; (b) systems based on crowd-sourcing and human com- 
putation; (c) algorithms based on predictive features that separate 
bots from humans. Our framework falls in the last category. 

Bots exhibit great variability and diversity in terms of behavior, 
capabilities, and intent: this was illustrated with a categorization 


scheme proposed in another recent survey [43]. Another recent 
white paper discussed the capabilities of bots powered by sophisti- 
cated Artificial Intelligence [2]. Bots also attracted the attention of 
the cyber security research community: Sometimes, large groups 
of bots are controlled by the same entity, called bot master, acting 
behind the scenes in a command-and-control fashion, in analogy 
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Figure 5: Distribution of activation values of LSTM hidden units (of which there are 32) on tweets generated by bots. 
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Figure 6: Distribution of activation values of LSTM hidden units (of which there are 32) on tweets generated by genuine users. 


to traditional botnets used to deploy cyber attacks and other cyber- other authors adopted supervised learning to analyze all accounts 
security threats [25], as demonstrated on Twitter as well [1, 16]. of some platform and separate bots from humans [7, 17, 37, 60]. 

Much work on bot detection assumes extensive access to social Although these approaches can be useful, for example to detect 
media data. For example, Wang and collaborators used clustering large-scale bot infiltrations, they can be implemented exclusively 


techniques to identify large-scale behavioral anomalies [56], while 


by social media service providers with full access to data and sys- 
tems. Some of them published studies showing the effectiveness of 
some implementations, e.g., SybilRank [8], the Facebook Immune 
System [52], and others [3]. To obviate the limitation of unlimited 
data access, other techniques have been designed to require smaller 
samples of user activity, and fewer labeled examples of bot and hu- 
man users. Examples of such trend include the classification system 
proposed by Chu et al. [10, 11], the system based on crowd-sourcing 
designed by Wang et al. [57], the detection techniques based on 
NLP presented by Clark et al. [13], and BotOrNot [15]. 

To allow for bot detection at the user level, all these methods still 
require the analysis of some historical user data, either by indirect 
data collection [10, 11, 13, 57], or, like in the case of BotOrNot [15], 
by interrogating the Twitter API (which imposes strict rate lim- 
its, making it impossible to do large-scale bot detection). To the 
best of our knowledge, no tweet-based detection system existed 
prior to this work. We filled this gap by designing a LSTM deep 
neural network based on combinations of textual features and user 
metadata that is capable of determining if a single tweet is being 
posted by a human or a bot with extremely high accuracy. The 
same architecture shows nearly-perfect accuracy in the user-level 
bot detection task. 


6 CONCLUSIONS 


Given the prevalence of sophisticated bots on social media platforms 
such as Twitter, the need for improved, inexpensive bot detection 
methods is apparent. We proposed a novel contextual LSTM ar- 
chitecture allowing us to use both tweet content and metadata to 
detect bots at the tweet level. From a single tweet, our model can 
achieve an extremely high accuracy exceeding 96% AUC. 

We show that the additional metadata information, though a 
weak predictor of the nature of a Twitter account per se, when 
exploited by LSTM decreases the error rate by nearly 20%. In ad- 
dition to this, we propose methods based on synthetic minority 
oversampling that yield a near perfect user-level detection accuracy 
(> 99% AUC). Both these methods use a very minimal number of 
features that can be obtained in a straightforward way from the 
tweet itself and its metadata, while surpassing prior state of the art. 

In the future, we plan to make our system open source, and to 
implement a Web service (for example, an API) to allow the re- 
search community to perform tweet-level bot detection using it. 
From a research standpoint, we plan to use the proposed frame- 
work to scrutinize social media conversation in different contexts, 
in order to determine the extent of the interference of bots with 
public discourse, as well as to understand how their capabilities 
and sophistication evolve over time. 
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